Synopsis: Ambiguous Groups

Let's learn how using ambiguous groups causes an antipattern.

Suppose that your boss needs to know which projects in the bugs database are still active and which projects have been abandoned. One report they ask you to generate is the latest bug reported per product. You write a query using the MySQL database to calculate the greatest value in the date_reported column per group of bugs sharing a given product_id. The report looks like this:

product_name latest bug_id
Open RoundFile 2010-06-01 1234
Visual TurboBuilder 2010-02-16 3456
ReConsider 2010-01-01 5678

Your boss is detail-oriented, and they spend some time looking up each bug listed in the report. They notice that the row listed as the most recent for “Open RoundFile” shows a bug_id that isn’t the latest bug. The full data shows the discrepancy:

product_name date_reported bug_id
Open RoundFile 2009-12-19 1234
Open RoundFile 2010-06-01 2248
Visual TurboBuilder 2010-02-16 3456
Visual TurboBuilder 2010-02-10 4077
Visual TurboBuilder 2010-02-16 5150
ReConsider 2010-01-01 5678
ReConsider 2009-11-09 8063

The bug_id doesn’t match the date in the date_reported column.

How can you explain this problem? Why does it affect one product but not the others? How can you get the desired report?

Objective: Get a row with the greatest value per group#

Most programmers who learn SQL get to the stage of using GROUP BY in a query, applying some aggregate function to groups of rows, and getting a result with one row per group. This powerful feature makes it easy to get a wide variety of complex reports using relatively little code.

For example, a query to get the latest bug reported for each product in the bugs database looks like this:

Querying for the latest bug reported for each product

A natural extension to this query is to request the ID of the specific bug with the latest date reported:

Querying for the latest bug reported for each product along with the ID

However, this query results in either an error or an unreliable answer. This is a common source of confusion for programmers using SQL.

The objective is to run a query that reports the greatest value in a group and includes other row attributes where that value is present.

Legitimate uses of the antipattern#

The Single-Value Rule is that every column in the select list of a query must have a single value row per row group. MySQL and SQLite can’t guarantee a reliable result for a column that doesn’t fit the Single-Value Rule. We will study this concept in detail in the next lesson. There are cases when we can take advantage of the fact that these databases enforce the rule less strictly than other brands.

A query that violates Single-Value Rule

The account_name column technically violates the Single-Value Rule in the previous query since it’s named neither in the GROUP BY clause nor in an aggregate function. Nevertheless, there is only one value possible for account_name in each group. The groups are based on Bugs.reported_by, which is a foreign key to the Accounts table. Therefore, the groups correspond one-to-one with rows in the Accounts table.

Functional dependency#

If we know the value of reported_by, then you know the value of account_name unambiguously, as if we had queried by the primary key of the Accounts table.

This kind of unambiguous relationship is called functional dependency. The most common example of this is between the primary key of a table and the table’s attributes: account_name is a functional dependency of its primary key, account_id. If we group a query by a table’s primary key column(s), then the groups correspond to a single row of that table, and therefore all other columns of the same table must have a single value per group.

The Bugs.reported_by attribute has a similar relationship with the dependent attributes of the Accounts table because it references the primary key of the Accounts table. When the query groups items by the reported_by column, which is a foreign key, the attributes of the Accounts table are functionally dependent, and the query result contains no ambiguity.

However, most brands of databases still return an error. Not only is this the return behavior required by the SQL standard, it’s also not too expensive to figure out functional dependencies on the fly. But if we use MySQL or SQLite and we’re careful to query only functionally dependent columns, we can use this kind of grouping queries and still avoid problems of ambiguity.

Solution: Use NULL as a Unique Value
Antipattern: Reference Non-grouped Columns
Mark as Completed
Report an Issue